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Abstract 

This paper develops methods of distributed Bayesian hypothesis tests 
for fault detection and diagnosis that are based on belief propagation 
and optimization in graphical models. The main challenges in developing 
distributed statistical estimation algorithms are i) difficulties in ensuring 
convergence and consensus for solutions of distributed inference problems, 
ii) increasing computational costs due to lack of scalability, and iii) com¬ 
munication constraints for networked multi-agent systems. To cope with 
those challenges, this manuscript considers i) belief propagation and op¬ 
timization in graphical models of complex distributed systems, ii) decom¬ 
position methods of optimization for parallel and iterative computations, 
and iii) distributed decision-making protocols. 

1 Introduction 

Stochastic inference using graphical models [7,21] have been significantly impor¬ 
tant research topics in a variety of disciplines that include signal processing [20] , 
machine learning [10], and artificial intelligence [18]. For the use of graphical 
models in statistical inference problems, optimal fusion of information and/or 
data over networked agents that are individual decision makers or processors 
and the design of compromised inference methods for distributed decision mak¬ 
ers have far significant importance. 

A monumental work of Pearl [18] called belief propagation (BP) is a message¬ 
passing algorithm for which local evidences are exchanged as messages that are 
used to update local beliefs and to find fixed-points of iterations, correspond¬ 
ing to marginal probability distributions of the node states. In a standard BP 
method for statistical inference in a graphical model, agents on the nodes ex¬ 
change messages with neighboring agents connected over the edges. The BP 
algorithm is known to provide exact marginal distributions when the graph¬ 
ical model are tree-structured, i.e., of no cyclic loops [18]. In the presence 
of cyclic loops in a graphical model, neither convergence nor optimally of BP 
methods cannot be, in general, guaranteed, whereas some empirical studies on 
performance of loopy BP [14] and conversion to equivalent cycle-free graphical 
models [6] are available. 
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The main challenges in the development of BP algorithms with general 
Markov and Bayesian graphical models are 

i) Convergence Analysis: As we previously mentioned, message-passing 
algorithms of BP do not generally converge to a fixed-point in the presence 
of cyclic loops. 

ii) Scalability: In a tree-structured graphical model, BP algorithms can find 
a fixed-point in 0(n) iterations where n is the diameter of the graph. How¬ 
ever, calculation of posterior marginal probabilities on nodes in an arbitrary 
bayesian network is known to be NP-complete [5,19] and even an approxi¬ 
mate computation of posterior marginal probabilities is NP-hard [9]. 

iii) Communication Constraints: Message-passing or information-exchange 
over communication networks are not necessarily reliable, and communica¬ 
tion bandwidth and energy constraints are typical sources of degrading 
performance of networked inference algorithms [3]. 

To cope with the aforementioned difficulties confronted to BP methods for 
statistical inference in graphical models, we consider 

—>i) Belief Optimization: In [25], it was shown that BP fixed-points corre¬ 
spond to the stationary points of the Bethe free energy approximation for 
a factor graph. The associated constrained minimization is called belief 
optimization (BO). Our statistical inference methods are based on the 
same principle that the joint probability distribution of the node states 
in a graphical model is a minimizer of the free energy and the beliefs, cor¬ 
responding to marginal probabilities of the node states, can be computed 
from minimizing approximate free energy such the mean field and Bethe 
free energies. The resultant statistical inference problems are given as 
constrained minimization. 

—Hi) Decomposition Methods of Optimization: Belief optimization is 
large-scale constrained minimization that becomes intractable and non- 
scalable as the number of nodes and cardinality of the node states in¬ 
crease. Since the coupling between marginal probabilities to be deter¬ 
mined are constrained on the edges in graphical models, natural ways of 
reducing computational demand are to use decomposition methods for 
optimization. 

—>iii) Distributed Decision Processes: In the presence of communication 
constraints, decision processes and information exchange need to be lo¬ 
calized and distributed for reliable statistical inference over graphical 
models. 

Our main applications of BP/BO methods are distributed hypothesis tests 
for fault detection and diagnosis (FDD) in large-scale distributed dynamical 
systems. Developing automatic monitoring, detection, and diagnosis of system 
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faults has rapidly growing importance as the size and complexity of systems in¬ 
crease. Most of existing methods for model-based FDD are centralized schemes 
in the sense that the central decision maker can access all measurements and 
the decision goal is to decide whether faults occur and determine types and lo¬ 
cations of faults. Distributed FDD is suitable for large-scale interconnected and 
networked dynamical systems such as multi-agent systems and power grids. Fur¬ 
thermore, since not all measurements are accessible to local processors and com¬ 
putation nodes, centralized FDD schemes may not be applicable to distributed 
systems. Belief propagation and optimization provide naturally suitable ways of 
distributed statistical inference and decision making, for which graphical mod¬ 
els are used for representation of interconnections and networks of local sensors 
(measurements) and processors (data/information-processing) and belief con¬ 
sensus constraints are required to be satisfied by exchanging messages for BP 
and by imposing public variable constraints for BO. 


2 Belief Propagation in Graphical Models 

BP algorithms are developed for graphical models. This section provides a con¬ 
cise discussion of graphical representations and the corresponding BP methods 
for distributed inference problems. There are two types of graphical models 
that are used to represent probabilistic and informational dependencies of ran¬ 
dom variables-Markov networks and Bayesian networks. A Markov network is 
defined with an undirected graph whose nodes correspond to random variables 
and the edges correspond to their probabilistic and information dependencies. 
A Bayesian network is defined with a directed graph whose nodes correspond 
to random variables and the arrows are used to denote causality constraints or 
class-property relations. Since our focus is on developing distributed Bayesian 
hypothesis tests for FDD using BP/BO, we only consider Markov network mod¬ 
els. Many research monographs for tutorial of graphical models are available 
(see [4,7,21], for example). 

2.1 Pairwise MRF 

Markov networks (aka Markov random held (MRF) models) are suited for rep¬ 
resenting conditional dependencies of the node states. 

Definition 1 (MRF). The random vector X is Markov with respect to the graph 
G = ( V,, E ) if for any partition of the node set V into disjoint sets A, B, C, 
in which B separates A and C, the degenerate random vectors X A , Xb, Xq 
corresponding to each node set are conditionally independent in the sense that 
PAB\c{%a.,Xb\x c ) = PA\B(x a \x b )P C \ B (Xc\xb), or equivalently P A \ BC (x a \x b , x c ) = 
PA\B{x a \x b ) (or symmetrically, Pc\AB(x c \x a ,x b ) = P C \B{x c \x b ))- 

The next theorem called The HammersleyClifford theorem provides a suf¬ 
ficient (and necessary) condition for which the joint probability distribution of 
the node states can be represented as an MRF. 
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Theorem 1 (Hammersley-Clifford Theorem (see [4,18])). The random vector 
X is Markov w.r.t. the graph G if (and only if for strictly positive probability 
distributions) its distribution can be factorized by a product of variables restricted 
to cliques, i.e., the joint probability can be factorized as the followings: 

P(x) = 7 II V’c(zc) (1) 

cec 


where 7 = (%2 X YlceC ^cixc)) 1 and C refers to the set of cliques in G. 

The ipc{x c ) are called the compatibility functions that correspond to the 
marginal probabilities, and their negative logarithms are referred to as potentials 
or potential functions, Vc(x c ) := — ln^c^c) > 0. The factorization (1) can be 
rewritten as 


P(x) = 7 



n ^(27) 

k c&c\v,e 


( 2 ) 


Assumption 1 (Pairwise Potentials). We assume that either 

i. there is no clique with more than two nodes in the graph G, or 

ii. the potentials are only defined by the variable as a single node in V or by 
the two variables as a pair of nodes on an edge in E. 

Under Assumption 1, there is no contribution of the last term in (2), i.e., 


P(x) = P(x) = 7 



(3) 


where P(x) can be interpreted as an approximation of the joint probability 
distribution P{x) of the random variable X that is Markov w.r.t. G = ( E , U), 
up to the 2 -cliques. 


2.1.1 Graphical models for distributed statistical inference 

From here, we assume that there are local measurements (or evidences) yu £ jVfc 
that are associated with the node k £ V. For any non-loopy graph, i.e, graphical 
models on trees, the compatibility functions can be represented in terms of 
the marginal probabilities up to the 2 -cliques: ipk(xk) = Pk{xk)p(yk\xk) for 
k £ V and = Pij{xi,xj)p{yi,yj\x i ,xj)/pi(xi)p{yi\xi)p j {xj)p{yj\xj) 

for (i,j) £ E. With this representation of the compatibility functions, P(X) 
can be rewritten as 


P(x) = 7 Pk(x k )p(y k \xk) j 

Vfcev / \(i,j)eE 


Pijixux^pjy^yjlxi^j) 

Pi (xi )p(yi | Xi )pj {xj )p(yj | Xj ) 


(4) 


5 



or 


P(x) = 7 



p{xj,Xj\yi,yj) 

p{x z \y z )p{x j \y j ) 


(5) 


where, for abusing notation, 7 might not be the same as the one in (3), but can 
be considered as an equivalent partition function (value). 

For the purpose of distributed statistical inference in a graphical model, 
a goal is to estimate the posterior marginal probabilities, for which messages 
from the neighboring nodes are required to have sufficient statistics of local 
measurements that can be considered as realizations from unknown probability 
distributions. 


Problem 1. Consider an undirected graph G = (V,E). Compute (or approxi¬ 
mate) the posterior marginal probabilities 

Pk(xk\yi, ■ • ■ ,y N ), keV (6) 


where N = |V|. 

To exactly solve Problem 1, the required property of a BP method is the 
relation of sufficient statistics 

Pk(xk\yi,m k ) =Pk{xk\yi, ■ ■ ■ ,Vn), keV (7) 

where m*, refers to the total messages delivered to Agent at the node k* 

2.1.2 Distributed belief propagation 

Algorithm 1 (Belief propagation algorithm). In a belief propagation algorithm, 
the belief at the node k in its state Xk is 

/3 k {x k ) oc ifk{xk) pe~yk{xk) (8) 

£&AT(k) 

and the message from the node I to the node k about the state Xk can be either 
the sum-product BP message 

Pe^-kjxk) oc y ^i’ek(xt,x k )ife{xe) p, u ^(xe) (9) 

XI uGAT(£)\{k} 


or the max-product BP message 

Pe^yk(xk) oc maxipik(xt,xk)ipt(xt) TT p u ^{x e ), (10) 

X£ A A 

ueM{i)\{k} 

where conditional dependence of the beliefs and messages on measurements Y = 
{yi}i =1 dropped for the sake of notation. 

* Agent k refers to a processor or decision maker at the node k. 
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In the aforementioned BP algorithms, there are slightly different methods of 
computing messages to be transmitted. They have different interests [25]: (a) 
the max-product BP message is to obtain a global state that is most probable in 
the Bayesian sense and consists of a local state maximizing the local belief, and 
(b) the sum-product BP message is to compute marginal posterior probabilities, 
given the total evidence or measurements that are available in the system. Their 
properties need to be clarified. 

The Max-Product BP A goal of a belief propagation algorithm for Bayesian 
estimation, particularly for a maximum a posteriori estimation, can be to achieve 
the relation 


Pk(xk) = u k maxp k (x k ,x-k\yi,---,y N ), \/x k ,VkeV, (11) 

X-k 

for given total measurement data {y k } £ Y, where each a k is a positive constant 
that is independent of the value of x k and results in /?*,(•) £ [0,1]. Alternatively, 
a slightly weaker relation is that for given measurement data {y k }, 

P k {x)<fi k (z) => maxp fe (x,X- k \yi, ■ ■ ■ ,y N ) < maxp fe (z,X- k \yi, ■ ■ ■ ,2/jv) , 

X-k X-k 

( 12 ) 

for all nodes k G V. Note that the above relation can ensure the marginal 
maximum a posteriori (m-MAP) estimation, i.e., 

x k = arg max p k (x) 

X t ( 13 ) 

= argmaxp fc lx, x_ k \yi, ■■■ ,y N ) 

X x 7 

and they indeed result in the joint MAP (j-MAP) estimator satisfying the rela¬ 
tion 

Ml = argmaxp(xi,--- ,x N \y l7 --- ,y N ). (14) 

{Xi} 

The Sum-Product BP Similar to the max-product BP algorithm, the goal 
of the sum-product BP is to achieve the relation 

Pk{x k ) = a k ^2p k {x k ,x- k \yi, - ■ ■ ,y N ), Vfc £ V, (15) 

x-k 

where the summation is computed for all realizations of the compound random 
vector X- k and each a k is a positive constant that is independent of the value of 
x k and results in (3 k (-) £ [0,1]. Note that this is indeed to estimate the marginal 
posterior probabilities, for given total measurements. 

Remark 1 . A notable discrimination of the sum-product BP against the max- 
product BP is that the combination of optimal m-MAP estimators x* k = argmax x /3 k (x), 
where the beliefs are obtained from the sum-product BP, does not necessarily 
compose of an optimal j-MAP estimation. 
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Iterative Message-Passing and Fixed-Points The following algorithm is 
a standard asynchronous iterative message-passing algorithm for belief propa¬ 
gation. 

Algorithm 2 (Parallel iterative message-passing algorithm). The belief at the 
node k in its state x k at time t is 

PkHxk) ocipk(x k ) fJ,f\ k (x k ) (16) 

eeAr(jk) 

and the message from the node t to the node k about the state x k at time t can 
be either the sum-product BP message update 

$\ k (xk) oc ^,iptk(xt,x k )ipt{x£) n ifi-Jifa) ( i? ) 

xi u£At(e)\{k} 


or the max-product BP message update 

l$\ k (xk) ocmax$e k (xe,x k )fa(xt) TT 1^+1 ( x *)- (18) 

u£Ar(£)\{k} 

3 Belief Optimization in Graphical Models 

3.1 Bethe-Peirerls Approximation to the Free Energy 

In [23-25], the authors showed that the fixed points of BP and its generaliza¬ 
tion are indeed associated with extrema of the Bethe and Kikuchi free ener¬ 
gies, respectively. Here, we provide a concise overview of some useful results 
from statistical physics. In particular, the observation that statistical inference 
problems can be represented as minimization of (approximate) free energy (see 
also [23,25]) motivates to study various approximate free energy. 

3.1.1 Gibbs free energy in statistical physics 

In statistical physics, the Boltzmann distribution law tells us that for the energy 
E’(x) associated with some state or condition x of a system, the probability 
distribution of its occurrence is given by 

P( x ) = tf exp(—F(x)/T) (19) 

where Z denotes the partition function (constant) and T is the temperature that 
can be set to be 1 without loss of generality. Comparing this to the factorization 
(1) gives 7 = 1/Z and E(x) = - Y^ceC ln V’cfac) = J2ceC V c( x c), he., the 
total energy is the sum of potentials over the system. To compute the distance 
between the belief /3(x) and the true joint probability distribution, use the 
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Kullback-Leibler (KL) distance that is defined by 


D (P\ \p) = 52p^ 1u ~^ 

= E^M^M +E^( X ) ln/3(x) + ln Z 

X X 


(20) 


such that D(/3\\p) = 0 if and only if /3 = p and D(f3\\p) > 0 for all /3 £ A where 
A refers to the set of probabilities. Define the Gibbs free energy by 

G(P) = J2 /3(x)£(x) +52 PW ln /3 (x) = 1709) - H(P) (21) 


such that D(f3\\p) = G(/3) — F where F = — \nZ is called the Helmholtz free 
energy, and U(f3) and H((3) refer to the average energy and the entropy, respec¬ 
tively. 


3.1.2 Approximate free energy 

Previously, we assumed that the joint probability p(x) is a function of the total 
energy function E(x). Suppose that the system is of a pairwise MRF with the 
graph G(V ., E) in which there is no potential related to cliques with more than 
two nodes. Then the corresponding energy of such a configuration is 

E(s) = -^ InV’fcOEfc) - 52 1 (22) 

kev ( i,j)eE 


A. The Mean Field Free Energy In the mean-field theory, the joint distribution 
/3(x) is approximated by the complete factorization, i.e, 

/?(x) « JJ Pk(xk)- (23) 

k£V 


With this approximate joint distribution under a pairwise MRF configuration, 
the mean-field average energy is 

Umhev) = - EE /3 k {x k ) In ip k (xk)- 52 52 Pi {Xi)Pj {Xj ) 111 Ipij (Xi,Xj) 

Xk (i,j)£E Xi,Xj 

(24) 


and similarly the mean-field entropy is 


H {{Pi} lev) = - 5252 /3fe(x fe )ln p k {x k ). (25) 

k£V xk 


Note that the mean field free energy G = U — H is a function of the separate 
one-node beliefs /?*,(•). 
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B. The Bethe Free Energy For more general approximation, the joint distri¬ 
bution /3(x) can be approximated by the factorization with one- and two-nodes 
beliefs, viz, 


n^je-E &ij ( x i > x j ) 
U ke v Pk{xk) qk ~ 1 


(26) 


where q k = |A/"(fc)|. With this approximate joint distribution under a pairwise 
MRF configuration, the Bethe average energy is 


U({f3k}k<=Vi {Pij}(i,j)eE) — ~ EE Pk(x k ) In ipk(x k ) 

/cG V 3Ck 

- y y, Pij [Xj , Xj ) In ipjj (Xj,Xj) 
{i,j)GE Xi,Xj 


(27) 


and similarly the Bethe entropy is 

H{{0k}kev,{Pij}(.i,j)€E) =E( % “ l)y,Pk(x k )lnP k (x k ) 

/cG vc k 

- y y Pij(xi,Xj)]n.fjij(xi,Xj). 


(28) 


Remark 2. In contrast to the mean-field energy, the Bethe free energy is not 
generally an upper bound on the true Gibbs free energy [25]. 


3.2 Belief Optimization 

Consider the discrete random variables X k £ X k = {ccfci, x k i, ■ • • ,x k n k } with 
probability one and \X k \ = n k for each k £ V. For the sake of notation, assume 
that all the nodes have the same cardinality of their supports, i.e. , n k = n for 
all k £ V. Define the probability vector and matrix by 


Pk = 


Pk (Xki ) 

Pk[Xkn\ 


, for k £ V 


(29) 


and 



Pij j Xjl ) 


Pij (Xini Xj 1) 


Pij (Zil ; Xjjf) 
Pij (Xin ; Xj n ) 


for (i,j) £ E, 


(30) 


respectively. The Belief Optimization (BP) is to find {Pk}k&v minimizing G for 
the mean-hied free energy approximation or ({p k }kev, {Pij}{i,j)eE) minimizing 

G for the Bethe free energy approximation. 
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3.2.1 Minimization of the mean held free energy 

A popular method of approximating a free energy is the aforementioned mean 
field approach for which an optimal configuration of beliefs, that is an approxi¬ 
mation of joint probability distribution, can be obtained as a factorization (23) 
and the associated factors {pk}k&v are optimal solutions of the constrained 
minimization 


min G = U — H 
s.t. e'/3k = 1, k G V 
o < p k < i, k e V 


(31) 


where U and H are given by (24) and (25), respectively. It can be explicitly 
rewritten as 

min — E Pk la tA ^ ' Pi In 'ibjj 3-j -t- ^ ( Pk In /i/„ 

k£V = fcev (32) 

s.t. pk G A, k G V 

where A = {p £ I" : e'p = 1, Pi G [0,1], Vi}. 


3.2.2 Minimization of the Bethe free energy 

Similar to minimization of the mean field free energy, an optimal configuration 
of beliefs, that is an approximation of joint probability distribution, can be ob¬ 
tained as a factorization (26) and the associated factors {{Pk}kev, {Pij}(i,j)£E) 
are optimal solutions of the constrained minimization 


min G = U — H 
s.t. e' pk = 1, k G V 
0<Pk<i, k eV 
e'Pij = Pj', Pije = pi, ( i,j) G E 


(33) 


where U and H are given by (27) and (28), respectively. It can be explicitly 
rewritten as 


min — E Pk'lnipk - Y [Pij o In 

k£V ( i,j)£E 

+ y (1 - gk)Pk In pk + y [Pij oln Pij] 


kGV 

s.t. pk G A, k G V 


(bj)S-E 


e Pij — Pj I Pij & — Pii (b j ) G E 


(34) 


where [Ao B] = Tr (A'B) refers to entry-wise sum of the Hadamard (aka Schur) 
product Ao B. 
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3.2.3 Minimization of the TAP free energy 

The TAP (Thouless-Anderson-Palmer) approach that is used to approximate 
free energy in statistical mechanics has been adopted robust decoding and sta¬ 
tistical inference based on belief propagation (see [8,11,12], for example), an 
optimal configuration of beliefs, that is an approximation of joint probability 
distribution, can be obtained as a factorization (23) and the associated factors 
{Pk}kev are optimal solutions of the constrained minimization 

min G = U - H-T 

(35) 

s.t. fa € A, k g V 

where T refers to the TAP-correction to the mean held free energy. This belief 
optimization based on the TAP free energy approximation is similar to the mean 
held free energy approach for which the marginal probability distributions are 
assumed to be independent. In addition, the TAP free energy approach can 
be considered as an approximation of the Bethe free energy approach up to 
the second order moment [22]. Due to its similarity to the mean held energy 
approach and lack of accuracy, compared to the Bethe free energy approach, we 
only focus on using the mean held and the Bethe free energy approaches and 
solving the corresponding constrained minimization problems. 


4 BP/BO Approaches to Belief Consensus 

This section develops decomposed methods to solve the optimizations presented 
in Section 3.2. In particular, methods of dual decomposition (see Appendix 
A.l) that solve the associated large-scale optimization are used for decentral¬ 
ized/distributed computations. 


4.1 Belief Consensus: Dual Decomposition Approaches 

4.1.1 Minimization of the mean field free energy 

Consider the constrained minimization (32). This large-scale optimization over 
a graphical model can be decomposed into separated constrained minimizations 
for which Agent i solves the optimization 


min 

Pi, {Pj}j£j^(i) 


-ft'ln^j- ^2 Pi'lnipij f3j + f3/\nf3i 
jeN(i) 


s.t. fteA, 

Pj = Pi, Vj e Af{i), 


(36) 


where the second constraint corresponds to the consensus between the agents 
on edges connecting the node of Agent i and M{i) refers to the set of Agents 
neighboring Agent i. 
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For fault detection and diagnosis with multiple hypotheses, assume that each 
agent has the same bank of hypothesized models and the objective of resultant 
distributed decision-making is to obtain optimal marginal beliefs {Pi}i^v that 
achieve the consistency in localized estimations, i.e., 

Marginal Belief Consensus I: Pi(x) = p(x), \/x £ Xj, Mi £ V (37) 

which can be rewritten as 

Pi = p, V* £ V, for some p £ A. (38) 

Incorporating the consensus requirement (38) into (36) results in a decomposed 
optimization for which Agent i solves 

min - Pi In ipi + Pi Mi Pi + Pi In Pi 

Pi’i. ( 39 ) 

s.t. Pi_ = P £ A, 

where M± = - E je N{i) In 'pij are nonnegative matrices since their entries cor¬ 
respond to compatibility functions or constraints and can be normalized to be 
in the interval [ 0 , 1 ] without deforming configuration of the free energy with 
respect to the beliefs. Notice that the pseudo variable P is a global variable 
that is required to be the same in all decomposed (slave) problems. 

Case 1: [For Mi >: 0] If the pairwise compatibility matrix Mi is positive 
semidefinite then the optimization (46) is convex and can be solved by using 
iterative dual decomposition methods, for which computations are decentralized 
for each Agent i and belief consensus is achieved by iterations to find an optimal 
Lagrange multipliers. For details of the use of dual decomposition methods and 
underlying theories, see Appendix A.l. 

Case 2: [For Mi Aa 0] If the pairwise compatibility matrix Mi is condition¬ 
ally positive semidefinite over the standard simplex A then the optimization 
(46) is convex. However, checking if Mi_ ^a 0 is indeed NP-hard [15]. If a prior 
knowledge of Mj ^a 0 is available, then one can use the same dual decom¬ 
position methods as Case 1. If there is no condition Mi ^a 0 a priori, then 
one might use semidefinite programming relaxation that can be found in the 
subsequent Case 3. 

The optimization (46) can be rewritten as 

- ft' ln^j + ( Mj , Bi) + p/lnfa 

&= p£ A, ( 40 ) 

Pi Pi = Bi, 


Case 3: [Indefinite Mi] 


mm 


S.t. 
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where (X , Y) = Tr(Af'F). Since for any /3* £ A 

ft ft' = Bj, <=> Bie = ft, Bi >z 0, rank(ft) = 1, e'fte = 1, (41) 

a convex relaxation of (40) can be 

-ft' hi ft + (Mj , Bi) + ft' In ft 

ft = /3 € A, ( 42 ) 

fte = ft, e'-fte = 1, ft = 1? ^ 0, 

where the rank constraint is not imposed and B is a pseudo variable that all 
Agents share, i.e., it is a global variable that is required to be the same in the 
all decomposed optimizations. The optimization (42) provides a suboptimal 
solution for (40) and the corresponding suboptimal value is a lower bound on 
the optimal value of (40). The resultant optimization (42) is convex and can be 
efficiently solved to find suboptimal solutions ft* = p for all Agents i £ V. In 
particular, we suggest to use dual decomposition methods (see Appendix A.l). 


min 

s.t. 


4.1.2 Minimization of the Bethe free energy 


Consider the constrained minimization (34). This large-scale optimization over 
a graphical model can be decomposed into separated constrained minimizations 
for which Agent i solves the optimization 

min — ft' In ft - ^ [ftj o In ft.,-] 
leAft) = = 

(1 — <Zi)ft/ In ft + [/% oln ;%] 

s.t. ft G A, 

ftj e = ft, j 6 7V(i) 

where the second constraint corresponds to the marginal probability constraint 
for the agents on edges connecting the node of Agent i. 

Similar to the mean field energy approach, for fault detection and diagnosis 
with multiple hypotheses, assume that each agent has the same bank of hypoth¬ 
esized models and the objective of resultant distributed decision-making is to 
obtain optimal marginal and pairwise marginal beliefs ({Pk}kev, {Pij}(i,j)eE) 
that achieve the consistency in localized estimations, i.e., 

Marginal Belief Consensus II: ft (ai) = /3(x), \/x £ Xi , Vi £ V 

Pij{x,y) = b{x,y), Vx £ Xi, My £ Xj, M(i,j ) £ E 

(44) 

which can be rewritten as 


Pi = ft Vi £ V, for some P £ A 
Pij = B, M(i,j) £ E for some B £ 12 


(45) 
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where f 1 = {A £ : Ae = p and e'A = q for some p,q £ A}. 

The use of Bayesian hypothesis tests for FDD needs a special attention, for 
which the hypotheses at the nodes are homogeneous. The pairwise marginal 
distributions are required to satisfy the conditions pij(x,y) = 0 for 

all x ^ y for all (i,j) £ E, which implies that the off-diagonal entries of pij 

are zeros for all (i,j) £ E , or equivalently, the matrix B in (45) is a diagonal 
matrix. 

Incorporating the consensus requirement (45) into (43) results in a decom¬ 
posed optimization for which Agent i solves 


min - Pi In ^ - pf’ <h + pf ' In & 
s.t. j3i= P £ A, 


(46) 


where at = X^e.A/(i) In-Cdiag^jj]) and diag[A] denotes the vector whose elements 

are the diagonal entries of A in order. The resultant optimizations (46) are 
indeed convex and can be efficiently solved to find global consensus optima 
Pi* = P for all Agents i £ V. In particular, we suggest to use dual decomposition 
methods (see Appendix A.l). 

Remark 3. In the aforementioned constrained optimization problems, the as¬ 
sociated Lagrangian multipliers can be considered as prices of disagreement be¬ 
tween agents (i.e., local beliefs). The gradient dynamics of primal (the belief 
states) and dual variables (the prices of disagreement) should be explicitly writ¬ 
ten and interpreted in terms of convergence rate, optimality, monotonicity, etc. 


5 Discussion 

This section discusses several issues on the use of belief propagation for dis¬ 
tributed statistical inference. We also present some open questions that are not 
fully answered in this chapter. The purpose of these discussions is to suggest 
future research directions for extensions and applications of BP/BO methods. 

5.1 Open Problems 

For proper usage of belief propagation and optimization to tackle distributed 
statistical inference problems, some underlying assumptions of BP/BO methods 
need to be further investigated. 

5.1.1 Correlated measurements 

Most of research works in the literature of belief propagation assume that each 
local measurement is conditionally independent given the other states at V (even 
given the states at its neighborhood). In other words, the likelihood functions 
have the relations 

p(y k \x k ,x- k ) = p(yk\xk) Vk£V. (47) 
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This assumption would be valid only for some special cases such as when the 
sensors are static (memoryless) and each source of uncertainty is localized. To 
see the role of this assumption of conditional independence in belief propagation, 
consider the next example of sensor fusion. 

Example 1 . Consider the Markov network model of sensor fusion depicted in 
Figure 1. Messages from Agents 2 and 3 to Agent 1 are computed by 

/fj-s-i(xi) oc ^ p(x 1 |x i )p(x i |y i ), /or j = 2,3, (48) 

x j 


where {yj} are the local measurements that are available to Agents j. Note that 
this is indeed a marginalization and results in 

Mj-yi(xi) ocp(x 1 |y J ), for j = 2,3, (49) 

and the resultant belief is 

fii (xr) oc p(x 1 |j/ 1 )/x 2 _ >1 (x 1 )p, 3 _ >1 (x 1 ) ^ 

oc p(xi |yi)p(xi |y 2 )p(xi|y 3 ). 

Under the assumption of conditional independence (47), the belief can be rewrit¬ 
ten as 


/3i(xi) ocp(xi|?/i,?/2,2/3) (51) 

that is the marginal probability of the state of Agent 1 for given total measure¬ 
ments. The marginal probabilities of Agents 2 and 3 can be computed in similar 
ways, viz., /3 2 (x 2 ) oc p(x 2 |yi, y 2 , y 3 ) and /3 3 (x 3 ) oc p(x 3 |j/i, y 2 , y 3 ). 

In the provious example, notice that without assuming or guaranteeing con¬ 
ditional independence, the messages /ij^(x,) for i ^ j = 1,2,3 result in the 
beliefs fti{xf) oc JT;=iwhich are not the same as the desired relations 
A(Xj) ocp(x,|?/i,2/2,Z/3). 

Fortunately, for the case of homogeneous hypotheses in graphical models, 
the likelihood functions (47) have the relations 

n 

p(y k \xk,x-k) = p{yk\xk) S(xk,Xj), Vfc G V, (52) 

j =i 

where 5[x, y) refers to the standard scalar Dirac delta function. This implies that 
for Example 1, the message-passing algorithms (48) achieve the correct beliefs 
(51) only if they satisfy the additional conditions of marginal belief consensus, 
viz., /3i(x) = ft 2 (x) = /l 3 (x) for all x £ At. 
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Agent 1 



Figure 1: A schematic cartoon of a Markov network for sensor fusion. The solid 
arrows correspond to communication links and the dotted arrows correspond to 
measurement mechanism. S: Sensor, P: Processor, S: Receiver, T: Transmitter, 
and E: Evidence (or Observational Event). 


5.1.2 Pre vs. Post data processing and information fusion 

The primary goal of message-passing algorithms is to provide sufficient statistics 
for computations of marginal probabilities. In the context of belief propagation, 
sufficient statistics of messages are properties that ensure the relations 

Pi{*i\yi, = p(x,;|P = {%}"= i) , Vx* £ Xi, V* = 1, • • ■ ,n. (53) 

Message-passing algorithms can be considered as post data processing for in¬ 
formation fusion, whereas transmitting raw data, not subject to any data pro¬ 
cessing, is a naive method for computations of marginal probabilities. Due to 
communication bandwidth limitations and cost of data storage, data transmis¬ 
sion is not practical nor efficient. 

In belief propagation algorithms based on graphical networked models, re¬ 
ducing communication costs has primary importance. Reducing size of trans¬ 
mitting messages with guaranteed exactness of resultant statistical inference is 
indeed to compute the smallest sufficient statistics. 

5.1.3 Suboptimality of consensus algorithms 

There was much research effort that studies convergence of message-passing al¬ 
gorithms in terms of properties of the graph G = (V,E) (see [1,13,17], for 
example). However, we should notice that convergence does not imply optimal¬ 
ity in general. Furthermore, such suboptimality can result in an arbitrarily bad 
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decision whenever the estimation problem is connected to an optimal control 
problem, in which inaccurate belief can deviate the resultant decision from an 
optimal one such that the achieved performance can be significantly worse off. 
For example, in [17], the average-consensus algorithm and a belief propagation 
method are combined-such an algorithm was refereed to as belief consensus. 
This belief consensus has many benefits such as scalability, convergence under 
varying network topology, etc. However, it was regret that the authors did 
not provide any analysis of optimality and sub-optimality of their methods for 
distributed hypothesis tests. ' Notice that convergence or consensus of beliefs 
or messages does not necessarily imply optimality of the resultant hypothesis 
testing. 

5.2 MAP Consensus 

In Section 4, belief consensus constraints-conditions of (37) for the mean field 
energy minimization and conditions of (44) for the Bethe free energy minimization- 
are incorporated into belief optimization to reach agreement in marginal and 
pairwise marginal probability distributions of multiple hypotheses for given total 
measurements. 

A popular statistical inference problem is to find a state that is the most 
probable from a probability distribution for given measurements. For graphical 
models of distributed hypothesis testing, such a state can be obtained from 
m-MAP or j-MAP estimation. Recall that an m-MAP estimator is a process 
to find state variables associated the nodes in a graphical model such that the 
corresponding marginal posterior probabilities have maximum values for given 
total measurements. Similarly, but slightly differently, a j-MAP estimator is 
a process to find a configuration of state variables in a graphical model such 
that the corresponding joint posterior probability is a maximum for given total 
measurements. For this purpose of inference, the aforementioned the max- 
product BP algorithms can be beneficial-using the max-product BP can reduce 
the communication costs, while the computational burdens of local processors 
would increase. 


6 Summary and Future Work 

This paper has developed methods of distributed Bayesian hypothesis testing, 
particularly, for applications to distributed fault detection and diagnosis in 
large-scale networked systems. The presented methods are in the basis of belief 
propagation and optimization and use graphical models to represent the systems 
of consideration. The resultant estimation problems reduce to solve distributed 
optimization for which the idea of belief optimization is adopted to use the con¬ 
cept of minimization of free energy to find an optimal probabilistic configuration 
of the state variables in Markov random fields. For distributed computations of 

' Performance of a consensus algorithm can be arbitrarily bad-convergence vs. optimality. 
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the associated constrained minimization problems, dual decomposition methods 
are used, which provide benefits of scalability and convergence. 

Several discussions on issues of efficient and proper use of belief propagation 
and optimization for distributed statistical inference problems are provided. 
Future research directions would be (a) to develop further generalization of belief 
optimization using the concepts of region-based free energy representations-they 
are extensions of pairwise potential energy descriptions-and (b) to evaluate 
exactness and compute approximation errors of an estimator that is obtained 
from minimizing an approximate free energy, to name of few. 

A Decomposition Algorithms 

A.l Iterative Dual Decomposition 

This chapter primarily considers two problems. The first problem is a standard 
form of decomposable optimization with linear consistency (or complicating) 
constraints and the second problem is its variation in which the local payoff 
functions are unequally weighted. 

Problem 2. Consider an optimization with the separable payoff function, sep¬ 
arable constraints, and equality consistency constraints of the form 


N 

maximize J(x, y) = E 4 {xk,yk) 

/c=i 


(54) 


subject to (x k ,y k ) G T k , k = l,...,N, 
Vk = C k z, k = 1,... ,1V, 


where ( x k ,y k ) is the kth pair of separable decision variables that correspond to 
the separated convex cost functions l k {x k , y k ), T k denotes the kth constraint for 
the separated decision variable pair (x kl y k ), and y k = C k z for k = 1 
are the consistency constraints, which are the only coupled constraints over the 
separated decision variables. 

A. 1.1 Lagrangian method and decomposition 

Consider the optimization (54). An associated augmented Lagrangian to relax 
the consistency constraint is given by 


N 


N 



fc =1 
N 


N 



( 55 ) 


k =1 
N 


k =1 
N 


^2(h(x k ,yk) - {vk,yk)) + ^2{C k v k ,z) , 
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where x = [aq, ■ • • ,x N \' , y = [yi, • • • ,vn]', v = [iq, • • • , vjv]',* and the super¬ 
script refers to the associated adjoint operator. 

Finding a saddle point that is a global optimal solution requires solving the 
two-stage optimization 


inf sup L(x, y, z,v) 

v (x,y)e.F,z 


(56) 


where T = T\ x • • • x Jjv refers to the product set of local (i.e., subsystem) 
constraints. From (55), the optimization (56) can be rewritten as 


/ N N 

inf sup V' {h(xk,yk) - {vk,yk)) + {C%Vk,z) 

v (x,y)e^z , 

'---V--- 

inf sup ^(4 (xk,Vk) ~ (vk,Vk))\ 
v (*.y)e^ / 

if C*v = 0 
. +oo otherwise 

= inf sup V(4 (x k ,y k ) - (v k ,y k ))] , 

Cv =o ( x, y) 6^Vfci J 

= inf V] sup (4 (xk,Vk) ~ {vk,yk))\ , 

C'v—O . ^ \ ixk ,y k )er k J 


(57) 


fc=i 


where C' = [C(, ■ • • , C' N \. 

The optimization (56) can be decomposed into two convex programs: 

Slave Problem: sup _ (4 {x k ,y k ) ~ (vk,Vk)), (58) 

(x k ,y k )^J r k '~ --- * 

Sk(x k ,y k \v k ) 

for k = 1 ,..., N, and 

N 

Master Problem: inf Sk(x k , y k \ v k), (59) 

C'v=0 ^ J s. j 

k=l 

Qk\vk) 

where (x^.,y^) refers to the optimal solution pair for the Slave Problem (58) for 
given Vk ■ 


A.1.2 Projection-(Sub)gradient method 

The Master Problem (59) can be solved using a first-order (sub-)gradient pro¬ 
jection method, whereas the Slave Problem (58) is of much smaller size and 
can be accurately and efficiently solved by a second-order method such as an 

tThe bold refers to global variables while the non-bold refers to local variables. 
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interior-point algorithm [2,16]. Consider that the Master Problem (59) can be 
rewritten as 

N 

inf VftW, (60) 

C'v—0 z — 
k =1 

where Qk{vk ) is convex in Vk for all k = 1, ...,1V. Define a linear subspace 
M = {v £ M' : C'v = 0}, which is the null-space of the matrix C'. The 
optimization (60) can be solved using a subgradient-projection method: 

v (n+1) := V M (v {n) - a n g^(y^)) , (61) 

where Vm : R* —> M refers to the projection on the subspace M, g^ : R* —> 
R* denoted the subgradient, i.e., g £ dvY^Qkivk), and a n is a step size that 
can be selected in any of standard ways (e.g., constant, diminishing, etc.). The 
sub-differential can be represented as 

/ N 

I y ' Qkjvk) 

\fc=i 

where y* denotes the concatenation of optimal solutions of the Slave Prob¬ 
lem (58) for a given sequence {ufc}- In other words, local subsystems are re¬ 
quired to sequentially report the computed public variables to the supervisor 
(or price-planner). Therefore, the update rule for the subgradient-projection 
method (61) can be rewritten as 

v {n+1) := TVi(v (n) + a„y (n) ) , (63) 


d Vl Qi(vi) x 


x d VN Q]y(vN) 


(62) 



Figure 2: Iterative dual decomposition of sequentially reporting public variables 
an d assigning prices The superscript (n) refers to the iteration 

sequence. 
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where the superscript * of y is removed for notational convenience. Furthermore, 
it is not hard to see that 

V M (z)={l-C(C'C)- 1 C')z, (64) 

so that 

v (n+!) := (! _ C'(C"C')- 1 C") + a„y (n) ) , 

:= v< n ) + a n (I - C , (C"C)- 1 C") y (n) , ( 65 ) 

'-V-' 

u 

where the computation of the matrix U needs to be performed only once and 
can be done offline (before performing optimization). 


A.1.3 Separable cost with coupled inequalities 


Problem 3. Consider an optimization with the separable payoff function, sep¬ 
arable constraints, and coupled inequality constraints of the form 


N 

maximize J(x) = E ^ k (*Efc ) 

fc= 1 

subject to x k E -Ffc, k = 1 , ..., N, 
Cx > 0, 


( 66 ) 


where x k is the kth separable decision variable that corresponds to the separated 
convex cost functions £k{%k)> Fk denotes the kth constraint for x k , and Cx = 
C’kXk for k = 1,..., N are coupled inequality constraints. 

An associated augmented Lagrangian is 


N 

L(x, v) = YM*k) ~ (v,Cx) , 
k= 1 

N N 

= Y tkfrk) ~ Y CkXk ) > 

k =1 k =1 


(67) 


where v > 0. The constrained optimization (66) can be decomposed into the 
two-stage optimization 


inf sup L(x, v) 

v >°xG-F 


= inf 

v>0 


sup 

yxe-F 


N 


N 


y^4(zfc) - {v,c k x k ) 


\k =1 


k =1 


( 


\ 


= inf 

v>0 


N 


y sup (4 (xk) - {v,C k x k )) 

k=l Xk ^-^ k 

\ Qfc(v) / 


( 68 ) 
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where T = JF\ x • • • x Tn refers to the product set of local (i.e., subsystem) con¬ 
straints. The optimization (68) can be decomposed into two convex programs: 

Slave Problem: sup {I k {x k ) - (v, C k x k )) , (69) 

Xk€F~k s -v--' 

Sk{xk |v) 

for k = 1,..., N, and 

N 

Master Problem: inf 6\(xj(|v), (70) 

k=1 ^~Q^7' 

where x% refers to the optimal solution pair for the Slave Problem (69) for given 
v. A similar projection-subgradient method as aforementioned can be used to 
solve this problem. 

Projection-(Sub)gradient Method: Starting from a feasible dual variable 
v(°) > 0, the sequences of primal-dual solutions can be computed as follows: 

:= arg max (l k (x k ) - (v (n) , CfcXfc)) , (71) 


and 


v (n+1) := [vW+a„^C' fe xi n)N j - 
\ fc=l /+ 


(72) 


where (a)+ has the ith element defined as ai if a,i > 0 and 0 otherwise. 
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